12 May, 2020

Introduction

South Korea during COVID-19

South Korea is one of the world’s most densely populated countries with 51.64 million people.

The first case of COVID-19 in this country was confirmed on the 20th of January 2020. Since then, there has only been 256 deaths caused by COVID-19.

Research questions

  • How has the epidemic evolved in South Korea?

  • Is there any correlation between the place of infection and severity of the disease?

  • Does any gender or age predispose for getting the disease or for a more severe outcome?

  • Are there any characteristic features in the high-prevalence disease areas?

  • Can a prediction of the disease confirmation be made based on the city?

Materials and methods

COVID-19 dataset from kaggle

COVID-19 dataset from kaggle

Workflow and Structure of the project

Data cleaning & augmenting

  • Get rid of unnecessary data
  • Remove non valid data
  • Converting data into a tidy format

  • filter(), mutate ()

  • Join tibbles
  • Create new variables
  • Transform
  • Subset the data

  • full_join(), mutate(), unite ()

Four main data frames were created regarding:

  • patient (patient_info_df + patient_route_df)
  • case
  • city
  • time (time_df, time_age_df, time_province_df)
  • serchtrend

SeoulFloating and Weather were not used

Problem: our data set repeats the same name for come columns.

Solution: added a prefix

Results

***

using nest()

score_org score_pca
0.4254386 0.495614

ANN

## Parsed with column specification:
## cols(
##   loss = col_double(),
##   accuracy = col_double()
## )
loss accuracy
1.825541 0.547619

Shiny app

Conclusion and discussion

  • Confirmed cases is high compared to deaths.

  • One peak (until beginning of April), follows logistic model

  • There’s no correlation between the place of infection and severity of the disease

  • Men die but more women are confirmed to be sick. Young people are driving the spread.

  • At least from the retrieved data, there is no strong difference.

  • People in their 70s and 80s have a higher fatality rage (as expected).

  • There are clusters of superspreaders and certain age range can be observed in each.

  • A higher disease prevalence can be observed in bigger cities and those who have nursing homes, which is very different from what can be seen in the countryside, where elderly population ratio and elderly alone present less cases.

  • Accuracy is just above 50 % - better than random with 4 classes.

  • Similar performance as kmeans.

Problems and solutions

Using different packages will mask some functions.

  • detach packages after each R cript.

Superspreaders

Correlation matrix